Your browser doesn't support javascript.
loading
Mostrar: 20 | 50 | 100
Resultados 1 - 13 de 13
Filtrar
Mais filtros










Base de dados
Intervalo de ano de publicação
1.
Gene X ; 5: 100035, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-32550561

RESUMO

BACKGROUND: The accurate identification of the exon/intron boundaries is critical for the correct annotation of genes with multiple exons. Donor and acceptor splice sites (SS) demarcate these boundaries. Therefore, deriving accurate computational models to predict the SS are useful for functional annotation of genes and genomes, and for finding alternative SS associated with different diseases. Although various models have been proposed for the in silico prediction of SS, improving their accuracy is required for reliable annotation. Moreover, models are often derived and tested using the same genome, providing no evidence of broad application, i.e. to other poorly studied genomes. RESULTS: With this in mind, we developed the Splice2Deep models for SS detection. Each model is an ensemble of deep convolutional neural networks. We evaluated the performance of the models based on the ability to detect SS in Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. Results demonstrate that the models efficiently detect SS in other organisms not considered during the training of the models. Compared to the state-of-the-art tools, Splice2Deep models achieved significantly reduced average error rates of 41.97% and 28.51% for acceptor and donor SS, respectively. Moreover, the Splice2Deep cross-organism validation demonstrates that models correctly identify conserved genomic elements enabling annotation of SS in new genomes by choosing the taxonomically closest model. CONCLUSIONS: The results of our study demonstrated that Splice2Deep both achieved a considerably reduced error rate compared to other state-of-the-art models and the ability to accurately recognize SS in other organisms for which the model was not trained, enabling annotation of poorly studied or newly sequenced genomes. Splice2Deep models are implemented in Python using Keras API; the models and the data are available at https://github.com/SomayahAlbaradei/Splice_Deep.git.

2.
Gene ; 763S: 100035, 2020 Dec.
Artigo em Inglês | MEDLINE | ID: mdl-34493371

RESUMO

BACKGROUND: The accurate identification of the exon/intron boundaries is critical for the correct annotation of genes with multiple exons. Donor and acceptor splice sites (SS) demarcate these boundaries. Therefore, deriving accurate computational models to predict the SS are useful for functional annotation of genes and genomes, and for finding alternative SS associated with different diseases. Although various models have been proposed for the in silico prediction of SS, improving their accuracy is required for reliable annotation. Moreover, models are often derived and tested using the same genome, providing no evidence of broad application, i.e. to other poorly studied genomes. RESULTS: With this in mind, we developed the Splice2Deep models for SS detection. Each model is an ensemble of deep convolutional neural networks. We evaluated the performance of the models based on the ability to detect SS in Homo sapiens, Oryza sativa japonica, Arabidopsis thaliana, Drosophila melanogaster, and Caenorhabditis elegans. Results demonstrate that the models efficiently detect SS in other organisms not considered during the training of the models. Compared to the state-of-the-art tools, Splice2Deep models achieved significantly reduced average error rates of 41.97% and 28.51% for acceptor and donor SS, respectively. Moreover, the Splice2Deep cross-organism validation demonstrates that models correctly identify conserved genomic elements enabling annotation of SS in new genomes by choosing the taxonomically closest model. CONCLUSIONS: The results of our study demonstrated that Splice2Deep both achieved a considerably reduced error rate compared to other state-of-the-art models and the ability to accurately recognize SS in other organisms for which the model was not trained, enabling annotation of poorly studied or newly sequenced genomes. Splice2Deep models are implemented in Python using Keras API; the models and the data are available at https://github.com/SomayahAlbaradei/Splice_Deep.git.


Assuntos
Genoma/genética , Genômica , Sítios de Splice de RNA/genética , Software , Algoritmos , Animais , Mapeamento Cromossômico , Biologia Computacional , DNA/genética , Drosophila melanogaster/genética , Éxons/genética , Humanos , Íntrons/genética , Redes Neurais de Computação
3.
Methods ; 166: 31-39, 2019 08 15.
Artigo em Inglês | MEDLINE | ID: mdl-30991099

RESUMO

Polyadenylation signals (PAS) are found in most protein-coding and some non-coding genes in eukaryotes. Their accurate recognition improves understanding gene regulation mechanisms and recognition of the 3'-end of transcribed gene regions where premature or alternate transcription ends may lead to various diseases. Although different methods and tools for in-silico prediction of genomic signals have been proposed, the correct identification of PAS in genomic DNA remains challenging due to a vast number of non-relevant hexamers identical to PAS hexamers. In this study, we developed a novel method for PAS recognition. The method is implemented in a hybrid PAS recognition model (HybPAS), which is based on deep neural networks (DNNs) and logistic regression models (LRMs). One of such models is developed for each of the 12 most frequent human PAS hexamers. DNN models appeared the best for eight PAS types (including the two most frequent PAS hexamers), while LRM appeared best for the remaining four PAS types. The new models use different combinations of signal processing-based, statistical, and sequence-based features as input. The results obtained on human genomic data show that HybPAS outperforms the well-tuned state-of-the-art Omni-PolyA models, reducing the classification error for different PAS hexamers by up to 57.35% for 10 out of 12 PAS types, with Omni-PolyA models being better for two PAS types. For the most frequent PAS types, 'AATAAA' and 'ATTAAA', HybPAS reduced the error rate by 35.14% and 34.48%, respectively. On average, HybPAS reduces the error by 30.29%. HybPAS is implemented partly in Python and in MATLAB available at https://github.com/EMANG-KAUST/PolyA_Prediction_LRM_DNN.


Assuntos
Genoma Humano/genética , Genômica/métodos , Redes Neurais de Computação , Software , Humanos , Poli A/genética , Poliadenilação/genética , Proteínas/genética
4.
Nucleic Acids Res ; 46(12): e72, 2018 07 06.
Artigo em Inglês | MEDLINE | ID: mdl-29617876

RESUMO

Identifying transcription factor (TF) binding sites (TFBSs) is important in the computational inference of gene regulation. Widely used computational methods of TFBS prediction based on position weight matrices (PWMs) usually have high false positive rates. Moreover, computational studies of transcription regulation in eukaryotes frequently require numerous PWM models of TFBSs due to a large number of TFs involved. To overcome these problems we developed DRAF, a novel method for TFBS prediction that requires only 14 prediction models for 232 human TFs, while at the same time significantly improves prediction accuracy. DRAF models use more features than PWM models, as they combine information from TFBS sequences and physicochemical properties of TF DNA-binding domains into machine learning models. Evaluation of DRAF on 98 human ChIP-seq datasets shows on average 1.54-, 1.96- and 5.19-fold reduction of false positives at the same sensitivities compared to models from HOCOMOCO, TRANSFAC and DeepBind, respectively. This observation suggests that one can efficiently replace the PWM models for TFBS prediction by a small number of DRAF models that significantly improve prediction accuracy. The DRAF method is implemented in a web tool and in a stand-alone software freely available at http://cbrc.kaust.edu.sa/DRAF.


Assuntos
Análise de Sequência de DNA/métodos , Fatores de Transcrição/metabolismo , Sítios de Ligação , Imunoprecipitação da Cromatina , DNA/química , DNA/metabolismo , Humanos , Aprendizado de Máquina , Matrizes de Pontuação de Posição Específica
5.
BMC Genomics ; 18(1): 33, 2017 01 05.
Artigo em Inglês | MEDLINE | ID: mdl-28056772

RESUMO

BACKGROUND: Finding a source from which high-energy-density biofuels can be derived at an industrial scale has become an urgent challenge for renewable energy production. Some microorganisms can produce free fatty acids (FFA) as precursors towards such high-energy-density biofuels. In particular, photosynthetic cyanobacteria are capable of directly converting carbon dioxide into FFA. However, current engineered strains need several rounds of engineering to reach the level of production of FFA to be commercially viable; thus new chassis strains that require less engineering are needed. Although more than 120 cyanobacterial genomes are sequenced, the natural potential of these strains for FFA production and excretion has not been systematically estimated. RESULTS: Here we present the FFA SC (FFASC), an in silico screening method that evaluates the potential for FFA production and excretion of cyanobacterial strains based on their proteomes. A literature search allowed for the compilation of 64 proteins, most of which influence FFA production and a few of which affect FFA excretion. The proteins are classified into 49 orthologous groups (OGs) that helped create rules used in the scoring/ranking of algorithms developed to estimate the potential for FFA production and excretion of an organism. Among 125 cyanobacterial strains, FFASC identified 20 candidate chassis strains that rank in their FFA producing and excreting potential above the specifically engineered reference strain, Synechococcus sp. PCC 7002. We further show that the top ranked cyanobacterial strains are unicellular and primarily include Prochlorococcus (order Prochlorales) and marine Synechococcus (order Chroococcales) that cluster phylogenetically. Moreover, two principal categories of enzymes were shown to influence FFA production the most: those ensuring precursor availability for the biosynthesis of lipids, and those involved in handling the oxidative stress associated to FFA synthesis. CONCLUSION: To our knowledge FFASC is the first in silico method to screen cyanobacteria proteomes for their potential to produce and excrete FFA, as well as the first attempt to parameterize the criteria derived from genetic characteristics that are favorable/non-favorable for this purpose. Thus, FFASC helps focus experimental evaluation only on the most promising cyanobacteria.


Assuntos
Biologia Computacional/métodos , Cianobactérias/genética , Cianobactérias/metabolismo , Ácidos Graxos não Esterificados/biossíntese , Algoritmos , Análise por Conglomerados , Simulação por Computador , Cianobactérias/classificação , Redes e Vias Metabólicas , Fotossíntese , Filogenia , Proteoma , Proteômica/métodos
6.
Sci Rep ; 5: 11136, 2015 Jun 15.
Artigo em Inglês | MEDLINE | ID: mdl-26073445

RESUMO

Honey bee colonies exhibit an age-related division of labor, with worker bees performing discrete sets of behaviors throughout their lifespan. These behavioral states are associated with distinct brain transcriptomic states, yet little is known about the regulatory mechanisms governing them. We used CAGEscan (a variant of the Cap Analysis of Gene Expression technique) for the first time to characterize the promoter regions of differentially expressed brain genes during two behavioral states (brood care (aka "nursing") and foraging) and identified transcription factors (TFs) that may govern their expression. More than half of the differentially expressed TFs were associated with motifs enriched in the promoter regions of differentially expressed genes (DEGs), suggesting they are regulators of behavioral state. Strikingly, five TFs (nf-kb, egr, pax6, hairy, and clockwork orange) were predicted to co-regulate nearly half of the genes that were upregulated in foragers. Finally, differences in alternative TSS usage between nurses and foragers were detected upstream of 646 genes, whose functional analysis revealed enrichment for Gene Ontology terms associated with neural function and plasticity. This demonstrates for the first time that alternative TSSs are associated with stable differences in behavior, suggesting they may play a role in organizing behavioral state.


Assuntos
Abelhas/genética , Encéfalo/metabolismo , Fatores de Transcrição de Resposta de Crescimento Precoce/genética , Regulação da Expressão Gênica no Desenvolvimento , Proteínas de Insetos/genética , Transcrição Gênica , Animais , Fatores de Transcrição Hélice-Alça-Hélice Básicos/genética , Fatores de Transcrição Hélice-Alça-Hélice Básicos/metabolismo , Abelhas/crescimento & desenvolvimento , Abelhas/metabolismo , Comportamento Animal , Encéfalo/crescimento & desenvolvimento , Fatores de Transcrição de Resposta de Crescimento Precoce/metabolismo , Proteínas do Olho/genética , Proteínas do Olho/metabolismo , Perfilação da Expressão Gênica , Proteínas de Homeodomínio/genética , Proteínas de Homeodomínio/metabolismo , Proteínas de Insetos/metabolismo , Família Multigênica , NF-kappa B/genética , NF-kappa B/metabolismo , Plasticidade Neuronal/genética , Fator de Transcrição PAX6 , Fatores de Transcrição Box Pareados/genética , Fatores de Transcrição Box Pareados/metabolismo , Regiões Promotoras Genéticas , Proteínas Repressoras/genética , Proteínas Repressoras/metabolismo , Transdução de Sinais
7.
PLoS One ; 9(6): e97338, 2014.
Artigo em Inglês | MEDLINE | ID: mdl-24921648

RESUMO

Metagenomics-based functional profiling analysis is an effective means of gaining deeper insight into the composition of marine microbial populations and developing a better understanding of the interplay between the functional genome content of microbial communities and abiotic factors. Here we present a comprehensive analysis of 24 datasets covering surface and depth-related environments at 11 sites around the world's oceans. The complete datasets comprises approximately 12 million sequences, totaling 5,358 Mb. Based on profiling patterns of Clusters of Orthologous Groups (COGs) of proteins, a core set of reference photic and aphotic depth-related COGs, and a collection of COGs that are associated with extreme oxygen limitation were defined. Their inferred functions were utilized as indicators to characterize the distribution of light- and oxygen-related biological activities in marine environments. The results reveal that, while light level in the water column is a major determinant of phenotypic adaptation in marine microorganisms, oxygen concentration in the aphotic zone has a significant impact only in extremely hypoxic waters. Phylogenetic profiling of the reference photic/aphotic gene sets revealed a greater variety of source organisms in the aphotic zone, although the majority of individual photic and aphotic depth-related COGs are assigned to the same taxa across the different sites. This increase in phylogenetic and functional diversity of the core aphotic related COGs most probably reflects selection for the utilization of a broad range of alternate energy sources in the absence of light.


Assuntos
Metagenoma , Microbiota/genética , Água do Mar/microbiologia , Adaptação Fisiológica , Luz , Microbiota/fisiologia , Família Multigênica , Filogenia
8.
PLoS One ; 8(7): e68857, 2013.
Artigo em Inglês | MEDLINE | ID: mdl-23874789

RESUMO

BACKGROUND: Initiation of transcription is essential for most of the cellular responses to environmental conditions and for cell and tissue specificity. This process is regulated through numerous proteins, their ligands and mutual interactions, as well as interactions with DNA. The key such regulatory proteins are transcription factors (TFs) and transcription co-factors (TcoFs). TcoFs are important since they modulate the transcription initiation process through interaction with TFs. In eukaryotes, transcription requires that TFs form different protein complexes with various nuclear proteins. To better understand transcription regulation, it is important to know the functional class of proteins interacting with TFs during transcription initiation. Such information is not fully available, since not all proteins that act as TFs or TcoFs are yet annotated as such, due to generally partial functional annotation of proteins. In this study we have developed a method to predict, using only sequence composition of the interacting proteins, the functional class of human TF binding partners to be (i) TF, (ii) TcoF, or (iii) other nuclear protein. This allows for complementing the annotation of the currently known pool of nuclear proteins. Since only the knowledge of protein sequences is required in addition to protein interaction, the method should be easily applicable to many species. RESULTS: Based on experimentally validated interactions between human TFs with different TFs, TcoFs and other nuclear proteins, our two classification systems (implemented as a web-based application) achieve high accuracies in distinguishing TFs and TcoFs from other nuclear proteins, and TFs from TcoFs respectively. CONCLUSION: As demonstrated, given the fact that two proteins are capable of forming direct physical interactions and using only information about their sequence composition, we have developed a completely new method for predicting a functional class of TF interacting protein partners with high precision and accuracy.


Assuntos
Biologia Computacional/métodos , Complexos Multiproteicos/metabolismo , Fatores de Transcrição/metabolismo , Bases de Dados de Proteínas , Humanos , Ligação Proteica
9.
Bioinformatics ; 29(13): i316-25, 2013 Jul 01.
Artigo em Inglês | MEDLINE | ID: mdl-23813000

RESUMO

MOTIVATION: Polyadenylation is the addition of a poly(A) tail to an RNA molecule. Identifying DNA sequence motifs that signal the addition of poly(A) tails is essential to improved genome annotation and better understanding of the regulatory mechanisms and stability of mRNA. Existing poly(A) motif predictors demonstrate that information extracted from the surrounding nucleotide sequences of candidate poly(A) motifs can differentiate true motifs from the false ones to a great extent. A variety of sophisticated features has been explored, including sequential, structural, statistical, thermodynamic and evolutionary properties. However, most of these methods involve extensive manual feature engineering, which can be time-consuming and can require in-depth domain knowledge. RESULTS: We propose a novel machine-learning method for poly(A) motif prediction by marrying generative learning (hidden Markov models) and discriminative learning (support vector machines). Generative learning provides a rich palette on which the uncertainty and diversity of sequence information can be handled, while discriminative learning allows the performance of the classification task to be directly optimized. Here, we used hidden Markov models for fitting the DNA sequence dynamics, and developed an efficient spectral algorithm for extracting latent variable information from these models. These spectral latent features were then fed into support vector machines to fine-tune the classification performance. We evaluated our proposed method on a comprehensive human poly(A) dataset that consists of 14 740 samples from 12 of the most abundant variants of human poly(A) motifs. Compared with one of the previous state-of-the-art methods in the literature (the random forest model with expert-crafted features), our method reduces the average error rate, false-negative rate and false-positive rate by 26, 15 and 35%, respectively. Meanwhile, our method makes ~30% fewer error predictions relative to the other string kernels. Furthermore, our method can be used to visualize the importance of oligomers and positions in predicting poly(A) motifs, from which we can observe a number of characteristics in the surrounding regions of true and false motifs that have not been reported before. AVAILABILITY: http://sfb.kaust.edu.sa/Pages/Software.aspx. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Regiões 3' não Traduzidas , Inteligência Artificial , DNA/química , Poli A/genética , Análise de Sequência de DNA/métodos , Algoritmos , Humanos , Cadeias de Markov , Motivos de Nucleotídeos , Poliadenilação , Software , Máquina de Vetores de Suporte
11.
Bioinformatics ; 29(1): 117-8, 2013 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-23110968

RESUMO

SUMMARY: In higher eukaryotes, the identification of translation initiation sites (TISs) has been focused on finding these signals in cDNA or mRNA sequences. Using Arabidopsis thaliana (A.t.) information, we developed a prediction tool for signals within genomic sequences of plants that correspond to TISs. Our tool requires only genome sequence, not expressed sequences. Its sensitivity/specificity is for A.t. (90.75%/92.2%), for Vitis vinifera (66.8%/94.4%) and for Populus trichocarpa (81.6%/94.4%), which suggests that our tool can be used in annotation of different plant genomes. We provide a list of features used in our model. Further study of these features may improve our understanding of mechanisms of the translation initiation. AVAILABILITY AND IMPLEMENTATION: Our tool is implemented as an artificial neural network. It is available as a web-based tool and, together with the source code, the list of features, and data used for model development, is accessible at http://cbrc.kaust.edu.sa/dts.


Assuntos
Arabidopsis/genética , Iniciação Traducional da Cadeia Peptídica , Software , Genoma de Planta , Genômica , Internet , Redes Neurais de Computação , Motivos de Nucleotídeos , Sensibilidade e Especificidade , Análise de Sequência de DNA
12.
Front Genet ; 3: 100, 2012.
Artigo em Inglês | MEDLINE | ID: mdl-22670148

RESUMO

Mutations in any genome may lead to phenotype characteristics that determine ability of an individual to cope with adaptation to environmental challenges. In studies of human biology, among the most interesting ones are phenotype characteristics that determine responses to drug treatments, response to infections, or predisposition to specific inherited diseases. Most of the research in this field has been focused on the studies of mutation effects on the final gene products, peptides, and their alterations. Considerably less attention was given to the mutations that may affect regulatory mechanism(s) of gene expression, although these may also affect the phenotype characteristics. In this study we make a pilot analysis of mutations observed in the regulatory regions of 24,667 human RefSeq genes. Our study reveals that out of eight studied mutation types, "insertions" are the only one that in a statistically significant manner alters predicted transcription factor binding sites (TFBSs). We also find that 25 families of TFBSs have been altered by mutations in a statistically significant manner in the promoter regions we considered. Moreover, we find that the related transcription factors are, for example, prominent in processes related to intracellular signaling; cell fate; morphogenesis of organs and epithelium; development of urogenital system, epithelium, and tube; neuron fate commitment. Our study highlights the significance of studying mutations within the genes regulatory regions and opens way for further detailed investigations on this topic, particularly on the downstream affected pathways.

13.
Bioinformatics ; 28(1): 127-9, 2012 Jan 01.
Artigo em Inglês | MEDLINE | ID: mdl-22088842

RESUMO

MOTIVATION: Recognition of poly(A) signals in mRNA is relatively straightforward due to the presence of easily recognizable polyadenylic acid tail. However, the task of identifying poly(A) motifs in the primary genomic DNA sequence that correspond to poly(A) signals in mRNA is a far more challenging problem. Recognition of poly(A) signals is important for better gene annotation and understanding of the gene regulation mechanisms. In this work, we present one such poly(A) motif prediction method based on properties of human genomic DNA sequence surrounding a poly(A) motif. These properties include thermodynamic, physico-chemical and statistical characteristics. For predictions, we developed Artificial Neural Network and Random Forest models. These models are trained to recognize 12 most common poly(A) motifs in human DNA. Our predictors are available as a free web-based tool accessible at http://cbrc.kaust.edu.sa/dps. Compared with other reported predictors, our models achieve higher sensitivity and specificity and furthermore provide a consistent level of accuracy for 12 poly(A) motif variants. CONTACT: vladimir.bajic@kaust.edu.sa SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.


Assuntos
Algoritmos , Redes Neurais de Computação , Poli A/análise , Genoma Humano , Humanos , Internet , Poli A/genética , Sensibilidade e Especificidade , Software
SELEÇÃO DE REFERÊNCIAS
DETALHE DA PESQUISA
...